2025-12-05

Motivation

Non-linear regression

Given \(n\) data points \(\{x_i, y_i\}_{i=1}^n \xrightarrow{\text{Predict}} (x_{\texttt{new}}, y_{\texttt{new}})\)

Non-linear regression

Given \(n\) data points \(\{x_i, y_i\}_{i=1}^n \xrightarrow{\text{Predict}} (x_{\texttt{new}}, y_{\texttt{new}})\)

Non-linear regression

Given \(n\) data points \(\{x_i, y_i\}_{i=1}^n \xrightarrow{\text{Predict}} (x_{\texttt{new}}, y_{\texttt{new}})\)

Can all be done with Gaussian distributions!

Review of the Multivariate Normal (MVN) in 2-d

If we have a vector of random variables \(\mathbf{x}\) and

\[ \mathbf{x} \sim \mathcal{N}_d(\boldsymbol{\mu}, \mathbf{\Sigma}) \]

then, the joint probability mass of \(\mathbf{x}\) is given by the multivariate normal:

\[ p\left( \mathbf{x} \,|\, \boldsymbol{\mu}, \mathbf{\Sigma} \right) \propto \exp \left\{ -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right\} \]

Review of the Multivariate Normal (MVN) in 2-d

A two-dimensional MVN with \(\boldsymbol{\mu}=[0,0]\) and \(\Sigma=\begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix}\):

# mean vector
mu <- c(0, 0)

# covariance matrix
Sigma <- matrix( c(1, 0.7, 
                   0.7, 1), nrow = 2) 

# grid of x1 and x2 values
x1x2_grid <- expand.grid(x1 = seq(-3, 3, length.out = 100), 
                         x2 = seq(-3, 3, length.out = 100))

# probability contours
probabilities <- dmvnorm(x1x2_grid, mean = mu, sigma = Sigma)

Review of the Multivariate Normal (MVN) in 2-d

Probability contour plot:

Review of the Multivariate Normal (MVN) in 2-d

MVN pdf in 3 dimensions:

Review of the Multivariate Normal (MVN) in 2-d

Probability contour plot:

Conditioning of the MVN in 2-d

We can condition on one of the variables, \(p(x_2 \,|\, x_1,\,\Sigma)\)

Gaussian Processes

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

MVN in mulitple dimentions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]

\(\Sigma\) matrix in 10 dimensions

\[ \mathbf{\Sigma} = \begin{bmatrix} 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 & 0.00 & 0.00 \\ 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 & 0.00 \\ 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 \\ 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 \\ 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 \\ 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 \\ 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 \\ 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 \\ 0.00 & 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 \\ 0.00 & 0.00 & 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 \end{bmatrix} \]

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

MVN in mulitple dimentions

Fixing points with the conditional distribution

In a simple example, imagine we are given three data points \(\{(x_i, y_i)\}_{i=1}^3\). We need to predict the function thereafter

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Fixing points with the conditional distribution

Mathematical formalism

Definition of Gaussian Processes (GPs)

A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

As a distribution over functions, a Gaussian process is completely specified by two functions:

  • A mean function, \(m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]\) and
  • A covariance function, \(k(\mathbf{x}, \mathbf{x^\star}) = \mathbb{E}\Big[\big(f(\mathbf{x})- m(\mathbf{x})\big) - \big(f(\mathbf{x^\star})- m(\mathbf{x^\star}\big) \Big]\)

\[ f(\mathbf{x}) \sim \mathcal{GP}\big(\, m(\mathbf{x}), k(\mathbf{x}, \mathbf{x^\star}) \, \big) \]

Regression

Generative model

\[ y(\mathbf{x}) = f(\mathbf{x}) \Big[ + \epsilon\sigma_y \Big]\\ p(\epsilon)=\mathcal{N}(0,1) \]

Place GP prior over the nonlinear function (mean function often taken as 0).

\[ p(f(\mathbf{x}) \, | \, \theta) = \mathcal{GP}\big(0, k(\mathbf{x}, \mathbf{x^\star})\big)\\ k(\mathbf{x}, \mathbf{x^\star}) = \sigma^2 \exp \left\{ -\frac{1}{2\ell^2}(x-x^\star)^2 \right\} \]

Predictions

\[ p(y_1, y_2) = \mathcal{N} \left( \begin{bmatrix} \boldsymbol{\mu_1\\ \mu_2} \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma}_{11} & \mathbf{\Sigma}_{12}\\ \mathbf{\Sigma}_{21} & \mathbf{\Sigma}_{22} \end{bmatrix}\right) \] Bayes’ theorem:

\[ p(\mathbf{y}_1 \,|\, \mathbf{y}_2) = \frac{p(\mathbf{y}_1, \mathbf{y}_2)}{p(\mathbf{y}_2)} \]

With an involved proof:

  • Predictive mean (linear projection in inverse distance space): \(\boldsymbol{\mu}_{\mathbf{y}_1|\mathbf{y}_2} = \boldsymbol{\mu_1} + \mathbf{\Sigma}_{12}\mathbf{\Sigma}_{22}^{-1}(\mathbf{y}_2-\boldsymbol{\mu}_2)\)
  • Predictive variance (quadratic reduction in variance): \(\mathbf{\Sigma}_{\mathbf{y}_1|\mathbf{y}_2} = \mathbf{\Sigma}_{11} - \mathbf{\Sigma}_{12}\mathbf{\Sigma}_{22}^{-1}\mathbf{\Sigma}_{21}\)

Final visualization

Initial conditions

Mean prediction

Mean prediction + 100 function draws

Add a point further away

Optimization using Gaussian Processes (Bayesian Optimization)

Using Gaussian Processes for optimization

  • Premise: We have an unknown and expensive-to-evaluate objective function, \(y(x)\)
  • Goal: Return an value \(x_M\) which corresponds to a minimum \(y(x_M)\)
  • High evaluation cost requires selecting the location of new evaluations carefully

Bayesian Optimization simplified algorithm

In Bayesian optimization, a surrogate function (usually called an acquisition function) is used to select the next best evaluation location

Steps:

  1. Evaluate the objective function at a point.
  2. Fit/update Gaussian process model with the data.
  3. Optimize an acquisition function to select the next sampling point.
  4. Repeat until a defined convergence or exhaustion of resources.

Example of Bayesian Optimization in action

Example of Bayesian Optimization in action

Example of Bayesian Optimization in action

Example of Bayesian Optimization in action

Example of Bayesian Optimization in action

Exploration versus Exploitation

Explore points with high variance.

Exploit points most likely to be minimum.

Notion of improvement

  • We are minimizing our objective function \(y(x)\).
  • Call the best solution so far \(x_M^\star\).
  • Improvement can be defined as:

\[ I(x_{\texttt{next}}) = \max\left\{0, f(x_M^\star) - \widetilde{f}(x_{\texttt{next}}) \right\} \]

Each point \(x_{\texttt{next}}\) in a Gaussian process is associated with a mean given by evaluating the mean function at \(x_{\texttt{next}}\), \(\mu(x_{\texttt{next}})\) and a variance given by evaluating the variance function at \(x_{\texttt{next}}\), \(k(x, x_{\texttt{next}})\).

Osborne et al.: One-step Lookahead

  • Function evaluations so far: \((\mathbf{x}_0, \mathbf{y}_0)\)
  • Define \(\eta := \min \mathbf{y}_0\)
  • Define loss function:

\[ \lambda(y) := \begin{cases} y;\quad y < \eta \\ \eta;\quad y \ge \eta \end{cases} \]

Fancy way of saying loss is \(\min(y, \eta)\).

Osborne et al.: Expected loss

\[ \mathbb{E}[\lambda(y)] := \int_{y=-\infty}^{y=\infty} \lambda(y)\cdot p(y \,| \,x, I_0) \, dy \]

where \(x\) is the new selected point and \(I_0\) is the current “knowledge” (e.g., Gaussian process parameters).

Our integral must be split at the critical value \(\eta\). When \(y<\eta\), \(\lambda(y)=y\) and when \(y\ge\eta\), \(\lambda(y)=\eta\).

\[ \mathbb{E}[\lambda(y)] := \int_{y=-\infty}^{y=\eta} y\cdot p(y \,| \,x, I_0) \, dy + \int_{y=\eta}^{y=\infty} \eta\cdot p(y \,| \,x, I_0) \, dy \]

Osborne et al.: Expected loss

Note that \(p(y \,| \,x, I_0)\) is driven by the Gaussian process!

\[ p(y \,| \,x, I_0) = \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \]

Plugging this into the integrals, we get

\[ \begin{multline} \mathbb{E}[\lambda(y)] = \int_{-\infty}^{\eta} y\cdot \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \, dy \\ + \int_{\eta}^{\infty} \eta\cdot \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \, dy \end{multline} \]

\[ \begin{multline} =\eta + (m(y\,|\,I_0) - \eta) \Phi(\eta\,|\,m(y\,|\,I_0),k(y_0,y\,|\,I_0)) \\ - k(y_0,y\,|\,I_0)\phi(\eta\,|\,m(y\,|\,I_0),k(y_0,y\,|\,I_0) ) \end{multline} \]

Figure 1: Expected loss in action

Figure 1: Expected loss in action

Figure 1: Expected loss in action

Figure 1: Expected loss in action

Derivative observations, avoiding conditioning problems

  • Strongly correlated observations makes for poorly conditioned covariance matrices
  • Derivative observations can be seamlessly incorporated into a GP
  • A function over which we have a GP is jointly Gaussian with its derivatives
  • Derivative observations can be treated just as observations of the function itself (with modified covariance function)
  • A derivative is only weakly correlated with the function very nearby, improving conditioning

Results of GPGO

GPGO was used to find the optima for several known, difficult functions.

Performances was measured using “gap”:

\[ G := \frac{y(x^{\texttt{first}}) - y(x^{\texttt{best}})}{y(x^{\texttt{first}}) - y^{\texttt{opt}}} \]

  • \(y^{\texttt{opt}}\) is the known global optimum
  • \(x^{\texttt{first}}\) and \(x^{\texttt{best}}\) are the first point evaluated and best point found

Results of GPGO

Limitations and future work

  • Work does not explore optimization problems with which the new GPGO method might struggle or fail
  • Explore reasons for lower performance of GPGO compared to other methods when optimizing certain functions such as Shu or A5
  • Plan to apply BO methods to computationally intensive objection functions in clinical trial design

Credit

Journal article: Osborne et al. (2009)

Mathematics/derivations: Williams and Rasmussen (2006), Osborne et al. (2009)

Inspiration for some of the visualizations: Turner (2016), Gramacy (2022a), and Gramacy (2022b)

Bayesian optimization visualization: Farina (2024)

References

Farina, Prof. Gabriele. 2024. “Lecture 16 - Bayesian Optimization.” Massachusetts Institute of Technology (MIT), April 25. https://www.mit.edu/~gfarina/2024/67220s24_L16_bayesian_optimization/L16.pdf.

Gramacy, Robert B. 2022a. “Surrogate Modeling and Bayesian Optimization.” Virginia Tech, September 21. https://www.youtube.com/watch?v=0Jj9ikDxqf4.

Gramacy, Robert B. 2022b. “Surrogate Modeling and Bayesian Optimization (Part 2).” Virginia Tech, September 21. https://www.youtube.com/watch?v=JKp32HVxIQs.

Osborne, Michael A, Roman Garnett, and Stephen J Roberts. 2009. “Gaussian Processes for Global Optimization.” 3rd International Conference on Learning and Intelligent Optimization (LION3).

Turner, Richard. 2016. “Machine Learning Tutorial Series: Gaussian Processes.” Imperial College London, November 23. https://www.youtube.com/watch?v=92-98SYOdlY.

Williams, Christopher KI, and Carl Edward Rasmussen. 2006. Gaussian Processes for Machine Learning. Vol. 2. MIT press Cambridge, MA.